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A method to predict functional 
residues in proteins 

Georg Casari, Chris Sander and Alfonso Valencia 



The biological activity of a protein typically depends on the presence of a small 
number of functional residues. Identifying these residues from the amino acid 
sequences alone would be useful. Classically, strictly conserved residues are 
predicted to be functional but often conservation patterns are more complicated. 
Here, we present a novel method that exploits such patterns for the prediction of 
functional residues. The method uses a simple but powerful representation of entire 
proteins, as well as sequence residues as vectors in a generalised 'sequence space'. 
Projection of these vectors onto a lower-dimensional space reveals groups of 
residues specific for particular subfamilies that are predicted to be directly involved 
in protein function. Based on the method we present testable predictions for sets of 
functional residues in SH2 domains and in the conserved box of cyclins. 



EM3L-Meiridberg D- Biological sequence data are accumulating rapidly as a 
69012 Hwdelberg, resu | t 0 f advanced sequencing technology and concerted 
2p--"»ny genome projects. The probability that a new protein can 

be classified as a member of a sequence family is already 
near 50%'. The more members of a family are known, 
the more we begin to learn about the evolutionary con- 
straints that conserve residues or their properties at par- 
ticular sequence positions. Evolutionary constraints arc 
imposed by requirements of three-dimensional structure 
and of biological function. In general, functional require- 
ments are known to be more pronounced in terms of 
residue identities than structural constraints: completely 
conserved residues in a dispersed protein family usually 
have a direct role in function. For example, the conserved 
Ser-His-Asp triad of serine proteinases performs the key 
steps in catalysis; similarly, the conserved Asnof the Asn- 
l.ys-X-Asp motif of G-domains makes a specific pair of 
hydrogen bonds to the guanine base of bound OTP or 
CDR Given a multiple sequence alignment, it is gener- 
ally straightforward to spot, the most conserved residues 
and predict their involvement in function. Mutation of 
such residues typically causes loss of protein function. 

Sequence conservation is less obvious for residues that 
modulate the specificity of biological function. Such resi- 
dues change as a protein evolves to satisfy' modified func- 
tional constraints, while the basic biochemical mecha- 
nism and the overall three-dimensional fold remain un- 
altered. For example* the difference in peptide cleavage 
specificity between trypsin and chymotrypsin is achieved 
by residues of the required chemical type in the specific- 
ity pocket near the active site. Evolutionary changes in 



specificity tend to occur in jumps, that is residues that 
determine specificity are conserved within a subfamily 
of proteins, but differ between subfamilies. This step- 
wise behaviour is consistent with an evolutionary scenario 
in which functional requirements change rather sharply 
with a change in specificity and remain constant thereaf- 
ter. In general recognition of functional subfamilies by their 
characteristic residues is not easily done by mere inspec- 
tion of a multiple sequence alignment. 

Here we describe a straightforward and powerful new 
method to id en lily residues that are likely to be respon- 
sible for functional differences between protein subfami- 
lies. The approach requires only a multiple sequence 
alignment as inputand provides an experimentally test- 
able prediction in the form of likely functional residues. 
The analysis is illustrated with examples in which there 
is a proven correlation between prediction and experi- 
ment. We show testable predictions derived from this 
type of sequence analysis for two protein families of bio- 
logical interest where experimental evidence is incom- 
plete or not yet available. 

The method is based on analysts of protein multiple 
sequence alignments which can be generated using well - 
tested algorithms 2 . Typically, the sequences are retrieved 
by scanning the protein sequence databases for homo- 
lo'gues of a search sequence. Ideally, the family of homo- 
logies has several divergent members rather than almost 
identical sequences. As a rule of thumb, one includes 
pairs with fewer than 50% identical residues. The qua!- 
iiy of the multiple alignment is crucial, as all further re- 
sults depend on it. 
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Fig. 1 Sequence space analysis of the 
Ras-Rab-Rho superfamily. Two 

projections of the Ras-Rab-Rho 

superfamily defined by the three 

principal axes with largest 

eigenvalues, x1, x2 and *3. Proteins 

(open circles) are shown in the ieft 

hand plots (a.c), positions of single 

sequence residues (diamonds) 

projected onto the same planes are 

shown in corresponding right hand 

plots {b,d). Projection of a, proteins 

and 6, residues, onto a plane 

containing the direction of the 

consensus sequence pattern for the 

entire Ras-Rab-Rho superfamily 

(vertical) and a Ras specific direction 
(horizontal). In (a), proteins more 
representative of the Ras subfamily 
are farther to the left. In (b), single 
residues completely conserved in all 
three families occupy the extreme top 
corner (tricofoured), all of which are 
mvofved in GTP-btnding or hydrolysis 
Residues specific for the Ras subfamily 
form a corner in the Ras-specific 
direction (left, green) and all residues 
conserved in the Ras family (specific 
or not) occupy the upper left edge 
(fight green band) in this 
representation. c,d t Projection onto 
a plane containing the same Ras 
specific direction (horizontal) and a 
discriminating direction for Rho 
(vertical). In this representation all 
three protein families form separate 
clusters. Ycr7, a yeast protein (red), 
is not a member of any of these 
clusters and most likely the only 
representative of a new functional 

the corresponding families and 

speed, c for the single families (highlighted in two colouT^ 

obta<ned rom comparisons of complete sequences. Ras-I.ke proteins. The upper tree has been 

from d; (b ack diamonds). Classification into subfamilies becomes c learer ^Ind unaX, /n . / ^ ° f * Pe f K feSidues picked 35 corner$ 
o be crucial for function (highlighted green) in the known . stTrture ^ ne fP^e Ras-specific residues predicted 

(around a-hehx 2; marked «2> and switch I (effector loop e) knowr? ^Ze"act vith rIp"nS^ he SUffaCe region of switch 11 

page. The conservation pattern in the switch II region evident in a .selected I ™h«?l# mjcleotldc exchange factor, g, next 

a simple case. Some conserved residues involved i in are ,l,ustrates the ««nce of the method in 

subgroupsofRas, RabandRhoa, obtained from thisana.^ 




A vectorial representation 

The novelty of the approach comes Iron) a mathemati- 
cally convenient representation that allows grouping of 
protein families and identification of characteristic se- 
quence patterns in a unified fashion. We represent each 
sequence as a vector point in a multi-dimensional space, 
(sequence space), with residue positions and residue 
types as the basic dimensions. The formalism of princi- 
pal component analysis* can then be applied to deter- 
mine the directions in sequence space most stronely 
populated by the proteins in the family. Although the 
visual representation of proteins in this subspace appears 
similar to previous methods using multivariate statis- 
tics tor low-dimensional representation of protein fami- 
lies*, the underlying mathematical concepts are verv dif- 
ferent. These fundamental differences enable us not only 
lo define the protein subfamilies, as do other methods, 
but -also at the same time, to trace the principal compo- 



nents back to the individual residues and positions that 
are characteristic of the different subfamilies. 

The geometric origin of sequence space as defined 
here lS the central point of reference. Relative to the ori- 
gin, both direction and length of the vectors have a bio- 
logical meaning. Directions in sequence space represent 
specific sequence patterns (profiles), combinations of 
specific residue types at specific sequence positions. Hie 
directions of the principal axes can be interpreted as the 
sequence patterns that best discriminate between mem- 
bers of the protein family. Typically, the direct ion of the 
first principal axis (largest eigenvalue) corresponds lo 
the consensus pattern of the entire family. Preferred di- 
rections in this space reveal which residue types at which 
sequence positions best distinguish a subfamilv. The re- 
sulting concept is simple: proteinsofaparticularsubfomilv 
as well as residues characteristic for the function of this sub- 
family point to the same direction in sequence space. 



article 





> and 
those 
been 
mers 
ficted 
itch II 
next 
tod in 
if the 



sihat 

fined 
euri- 
i bio- 
esent 
ms of 
i.Thc 
as the 
mcm- 
of the 
ids to 
ed di- 
which 
he re- 
family 
Is sub- 
e. 

■ 1995 



rash_human 

rapa„human 

rsrl_yeast 

Ras specific 

sec4 _yeast 

rab5_canfa 

rab7_canfa 

Rab specific 

rhoa__hujman 

racl_human 

cc42__yeast 

Rho specific 

ycr7_yeast 

conserved 



--beta 2-- — beta3 -alpha 2- -beta 4 alpha 3 

EDSYRKQWIDGETCLLDILDTAGQEEySAMRDQYMRTGEGFliCVFAINNTKSFEDIHQYREQIKRV 
EDS YRKQVEVDCQQCMLE I LDTAGTEQFTAMRDLYMKNGQGFALtVY S ITAQSTFNDLODLREQILRV 
EDSYRKTIEiDNKVFDLEILDTAGIAQFTAMRELYIKSGMGFLLVYSVTDRQSLEELMELREQVLRl 

* *** ***** 

IDFKIKTVDINGKKVKLQLWDTAGQERFRTITTAYYRGAMGIILVYDVTDERTFTNIKQWFKTVNEH 
AAFLTQWCLDDTWKFEIWDTAGQERYHSLAPMYYRGAQAAIVVYDITNEESFARAKNWVKELQRQ 
ADFLTKEV>IVDDRLVTMQIWOTAGQERFQS1/5VAF^RGADCCTLVYDVNSVKSFDNLNNWREEFLIQ 

* ***** * 

FENYVADIEVMKQVELALWDTAGQEDYDRLRPLSYPDTDVIIiMCFSIDSPDSLENIPEKOTPEVXH 
FDNYSA^-iVDSKPV^GLWDTAGQEDYDRLRPLSYPQTDVFLICFSLVSPASYENVRA^FPEVRH 
FDNYAVTVMIGDEPYTLGLFDTAGQEDYDRLRPLSYPSTDVFLVCFSVISPPSFENVKEKWFPEVHH 

* * * * ** * * 
ENEFTRIIPYKSHDCTLEILDTAGQDEVSLLNIKSLTGVRGIMLCYSIINRASFDLIPILWDKLVDQ 
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The lengths of vectors represent the degree of con- 
servation. A protein more distant from the origin (Hig. 
i n) is more representative in that it contains a larger frac- 
tion of residues characteristic for this direction (pattern) 
in the subspace. Similarly, the most strongly conserved 
residues take the most distant positions (Fig. Ill) and 
form edges of the region occupied by ail residues. Resi- 
dues conserved in only one subfamily form the distant 
edge in the direction of this subfamily. Residues con- 
' served in two subfamilies occupy the corner where the 
two corresponding edges meet (Pig. Id 1 ). Clear clusters 
of residues on these corners and edges can reject strong 
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evolutionary selection and their member residues me 
predicted to be directly involved in function. In short, 
by representing both single residues and proteins in the 
principal dimensions of sequence space, protein sub- 
families arc evident as clusters and characteristic resi- 
dues as comers and edges. 

Biologically interesting directions in sequence space 
can also be defined by exploiting t) priori experimental 
knowledge, rather than by principal component analy- 
sis of sequence data alone. In this case functional axes 
that separate proteins according to their known func- 
tion are defined manually assigning H for proteins of 
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P»g. 2 Analysis of SH2 domain 
sequences for Src-like specificity, a, 
, 2 domains projected onto the 
Plane defined by the first two 
pnnc.ple axes (each circle represents 
an entire domain). The plane in 
sequence space defined by the first 
two principle axes contains a 
direction specific for the common 
consensus of all SH2$ (arrow) and a 

drrect.on specific for Src homologous 
sequences (Src. Fy„, L yn, Yes- 
ellipsoid). 6, Projection of residues' 
onto the same plane, used as the basis 
for prediction of Src-specific 
residues (each diamond represents 
a residue type at a specific position), 
single residues conserved in all SH2 
domains are identified at the 

SfrlT 6 RV nt ° f the "nsensus 
direction (blue circle: W7 Rm L21 

l l 2 ' R3A ' «6.'S46/H61 t ' 
64 F80 and 186 in 1SHA), while 
residues unique to the cluster of Src 
homologous kinases are identified 
at the extreme point of the Src 
spec,f lc direction (red circle: 122 

?f5« K ?' °w 9 ' L67 ' G?1 ' C98 

^06). Residues predicted to be 
functional occupy the edge of Src 
conservation between these 
*« re f?e Points (light red band: £37 
*VZ 2$ m ' m G7 0-Y73, 1X4,' 
' fiSfi. m. L10). The 

underlined residues are known to 
make specific contacts in crystal 
structures of SH2 domains compiled 
with peptide substrates (only one 
known contact residue is missed 
here)' 8 . The other seven residues 
constitute a genuine prediction of 
functional residues, c. Mapping of 
predicted functional residues (red) 
onto the known'3D structure of the v- 
Src SH2 domain, used as basis for 
independent verification of the 
prediction. Evidence in support of the 
prediction comes from the fact that 
most of the residues not yet known to 
be functional extend the surface patch 
of those residues already known to be 
'n contact with the bound 
pnosphoryrosyl peptide (white) 




he functional class and -I for nonfunctional proteins 
miction of residues onto these external axes suggests 

that was externally defined (not shown). 

w r Cti ^ al res ! dues in R^s-Hke proteins 

Rjb-Rho family ofamall GTPa.es, for which much ex- 
perimental information about functional residue, is 
avai ab e permming direct equation of the accuracy of 
pred.ct.on. This family has a simple phvlogeneii r 

l»on (hg 1,- top): Has proteins involved in signal iram- 
ducnon .including the human P 2. Ras protooncogene 
prolan; Rah yroran, involved in specific ran^in. of 



vesicle* wthin the cell; and Rho proteins involved in 
bind and hydrolvse GTR On Vdrolysis, a ma or con 
2 (suuch II) that acts as a signal for other cellular pro- 
Applying sequence space analysis to a multiple se- 
quence alignment of 1 .6 member proteins (en, y 
> P 2I hssp m database HSSP-). the first three principal 
axes (xL -v2 and *3 in Fig. ,) define the mo$[ ^ 
directions. Let us concentrate on the direction, defined 
first ax,. (. V) ) points in the direction of a sequence P:ll . 
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Fig. 3 Genuine predictions of functional residues for cyclins. Schematic representation of the 
cycltn box with conserved, G2/M specific and B-type specific residues. Predicted a-helices" are 
depicted as rods, |i-strands as arrows. Sequence positions are according to the human sequence 
of 81 cyclin (cgbljiuman}. Conserved residues are inserted into the schema (R202, W208, E221 
D231, R232, Q245, L246, G248, A265, K267, E270, E286, L290, L293 and F305). The second row 
marks positions of G2/M specific residues ( M201, L205, V212, F216, 1218, T222. V247, Y258, E259, 
Y277, M285, P300 and R308) and in the third row B-cyclin specific residues are indicated (0220* 
T250, D268, L310, H304 and R307). 



proteins (tricolotired corner at the top of Fig. 1 b). These 
completely conserved residues map to the GTP/GDP 
binding site* in the three-dimensional structure of Ras 
p2l and almost ail of them are known to be involved 
either in nucleotide binding or in catalysis. 

The second axis (x2) distinguishes the Ras subfamily 
from other members of the superfamily and defines a 
Ras-speci fie direction in sequence space (Fig. 1 «,c). Resi- 
dues conserved in the Ras family form an edge {green 
band in Fig. \b) that runs from the cluster of Ras-Rab- 
Rho conserved residues (top in Fig. \b) in the direction 
of Ras (left in Fig. I b). The more closely the vector of a 
residue points in the direction of the Ras cluster, the more 
specific it is for Ras proteins. The high conservation of 
all residues along this edge is most likely the result of 
strong evolutionary selection and these residues are pre- 
dicted to be involved in function. 

The direction of the third principal axis (x3) prima- 
rily separates Rho proteins from Ras and Rab. Rab, Ras 
and Rho subfamilies are best distinguished in this plane 
of axes x2 and x3 (Fig. lc) and form clear clusters. Single 
residues represented in this plane (Fig. 1*0 arc spread 
out according to their specificity for either Ras, Rab 
or Rho, Residues at the end of bisectors, where two 
edges meet, have dual specificity. For example, 
those at the lower left in Fig. 2d (green /orange Tyr 
40, Arg 68. Pro 34) arc conserved in Ras and Rho, 
but not in Rab proteins. They are likely to be re- 
sponsible for a functional aspect, defined at the 
single residue level, common to both Rho and Ras. 
An equivalent example is position 71. A Ser at this 
position is specific for Rho proteins. Rab and Ras share 
a Tyr, a case of dual specificity (Fig Ig). 

Supporting evidence for our interpretation of se- 
qtienqe space comes from an alternative analytical 
method and from experiment. When an evolutionary 
tree is constructed using only the subset of residues at 
the three specificity corners (Ras, Rab, Rho corners of 
Fig. VO. the structure of the tree simplifies greatly (Fig. 
l*r). The principal evolutionary event associated with 
the simplified tree is the functional differentiation of a 
primordial protein into the Ras, Itab and Rho function- 
ality. This leads to an alternative definition of specific- 
ity residues as the subset resulting in the sharpest evo- 
lutionary tree. The sequence space analysis method ef- 
ficiently determines such a residue subset in a simple 
one-step procedure. 

■"turai biology volume 2 number 2 february 1995 



Experimental confirmation of 
functional predictions can best be 
illustrated by mapping the Ras-spe- 
ahc residues onto the three-dimen- 
sional structure of Ras P 21 . Remark- 
ably, these residues map to the 
switch I and switch II regions (Fig. 
\J) known to be important for Ras 
function 7 . This regions make inter- 
actions with GAP, the GTPasc acti- 
vating protein, and the nucleotide 
exchange factor, that induces re- 
loading of OTP. Switch I covers the 
end of helix ex 1 and the effector loop 
where we identify Ala 18, Thr 20, 
Gin 25> Asp 33 and Pro 34. Switch II participates in sig- 
nal readout: in elongation factor Tu, a remote homo- 
logue of Ras p21 )0 , a strong conformational change in 
helix CX2 (switch II) as a result of GTP hydrolysis alters 
an exposed surface patch that strongly affects the inter- 
action with other domains of EF-Tn"- ,J . By close anal- 
ogy, the Ras GTPase switch is read out by its molecular 
partners interacting with a similar surface patch. The 
patch is predicted here to include residues Met 67, Arg 
68>Tyr7! andGIy75,on helix cc2;Val 103, Lys 104, Pro 
1 10 on helix «3; and Glu 37, Ser 39, Tyr 40, Arg 41, Uu 
56 on strands p2 and 03. 



Functional residues in SH2 domains 

A case illustrating the difficulties of visual analysis of 
multiple alignments is the group of Src homology 2 
domains (SH2). This family has many subgroups and 
few sequences in each subgroup. SH2 domains bind 
phosphotyrosine containing proteins and peptides with 
high specificity and many play a key role in regulatory 
processes (for example, those in protein kinasesj. In spite 
of several solved 3D structures, the full extent of the 
specificity pocket is not yet known". 

Analysing SB 2 domains from kinases, phospholi- 
pases and spectrins using the sequence space approach 
we find that the first two principal axes are the most 
informative. (The alignment of 52 SH2 sequences, cre- 
ated with the multiple alignment program Maxhom". 
can he obtained from I he aut hors on request.) As be- 
fore, the first principal axis describes the consensus of 
all sequences, while the second axis distinguishes Src- 
type receptor kinase sequences (Fig. la), a g roup k nown 
to have the same sequence specificity for 
phosphotyrosine peptides'*. In addition to the com- 
pletely conserved residues, those on an edge extending 
from the completely conserved duster (top in Fig. 2b) 
to the cluster of S re-specific residues (left in Fig. 2b) 
are predicted to be functionally important. To prevent 
overprediction, we exclude from consideration die clus- 
ter of Src-specific residues, as all sequences in this clus- 
ter have high pairwise sequence similarity. 

Mapping the predicted residues onto the known crys- 
tal structure of the 5rc-SH2 low affinity phosphotyrosyl 
peptide complex 17 confirms the accuracyof the predic- 
tion for some of the residues and leaves others as a genu- 
ine prediction. Completely conserved residues either 
participate in a small hydrophobic core that stabilises 
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the SH2 fold or map to the binding pocket for 
phosphoiyrosinc. The predicted thirteen specific residues 
mostly cover a nearby cleft . Rema rkubfy, six of these over- 
lap with residues known to make specific contacts with 
the bound peptide", only one of which is not identified. 
Seven more ore predicted to be involved in contacts pro- 
viding extra specificity on larger peptides. Residues iden- 
tified in the complex and those predicted here (Pig, 2c) 
define the specific binding site. This hypothesis can be 
tested by mutagenesis experiments of binding-site resi- 
dues predicted to alter peptide binding or change speci- 
ficity. 



Functional residues in cyclins 

The predictive power of the method can be tested most 
stringently in a case where no 3D structure is known. 
Cyclins are involved in control of cell cycle progression. 
Different eye I in types specifically control entry to dif- 
ferent cell cycle states". Sequence space analysis of the 
cyclin box, a homologous region in all cyclins { 1 10 resi- 
dues, 49 sequences)! identifies, as before, a set of com- 
pletely conserved residues, a set of residues functionally 
specific for lt-type cyclins, and a set of residues on a con- 
necting edge. The latter set is predicted to be G2/M spe- 
cific, as they arc exclusively found in cyclins involved in 
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Fig. 4 Representation of two protein sequences as vectors in sequence space: a, illustration of the translation of two 
sequences k and k' (top) into tables, with entries for each residue type at each position (pro hies) The number \ is 
entered for the residue type that actually occurs at a particular position in-the sequence, the number 0 for all other 
residue types Rearranging the entries of the table results in sequence vectors as shown in (b). These vectors define 
the location of the corresponding sequences in sequence space, b, The number of identities/* between sequence* 
and *' can be calculated as the vector product of the sequence vectors and F\ A box highlights an identity between 
sequences k and k' corresponding to their common G residue at position 66. 
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control of transition from G2 phase to mitosis (B-type 
and U-like tyclins from plants and fungi). We predict that 
introducing these residues into a different cyclin box (for 
example, that of cyclin A) should be sufficient to switch 
this domain to G2/M specificity. 

The prediction can be refined somewhat by mapping 
residues onto the predicted secondary structure (l : ig. 3). 
Completely conserved residues are likely to be involved 
in forming the structural core (hydrophobic residues near 
the centres of the predicted helices) or involved in cru- 
cial functional interactions (hydrophilic residues in loops 
or helix ends), possibly with cdc2. G2/M specific resi- 
dues in the predicted loops and helix ends are predicted 
to form a separate surface patch that forms interactions 
to specifically control the G2/M transition. 

Scope and utility 

The sequence space method developed here predicts func- 
tional residues by exploiting evolutionary information 
in a set of related sequences; the 'fossil record* of evolu- 
tion under selective pressure. As shown in the control 
examples the new method picks up subtle patterns of 
conservation and is capable of accurate predictions. The 
predictions have two levels. On the first level, residues ' 
identified as completely conserved and residues identi- 
fied as subfamily specific are predicted to be involved in 
some (initially unspecified) aspect of function. The im- 
plication for experiment is that point mutations in these 
residues arc predicted to lead to a strong phenotpc. On 
the second level, in favourable cases, particular functions 
can he assigned (in the predictive sense) to particular 
residues. Such more detailed predictions require some 
detailed knowledge of the biological functions common 
to all proteins in the family (these are assigned to the 
completely conserved residues) and/or of the functions 
specific to proteins in particular subfamilies (these are 
assigned to ihe residues specific for each subfamily), in 
the control example, residues common to the entire Ras 
Rab Hho family would be predicted to be involve in nucle- 
otide binding, while Ras specific residues would be pre- 
dicted to be involved in interactions with specific GTPase 
activating proteins, nucleotide exchange factors, or 
downstream effectors. l : or the SH2 domains of protein 
kinases, for which the most conserved residues are known 
to be involved in phosphotyrosine binding, the subfam- 
ily-specific residues are predicted to be involved in se- 
quence-specific binding of the flanking peptide. 

Like any predictive method, there is a margin of error 
in the predictions. The precise level of error will become 
apparent as the method is used in practice to plan spe- 
cific experiments. The likelihood of error is higher when 
sequence information is sparse. The method works best 
when ample sequences well dispersed in the subfamilies 
are available to triangulate sequence space. In simple 
cases, the results of ihe method agree with intuitive analy- 
sis of conserved residues by inspection of multiple align- 
ments ('sequence gazing'), in the identification of com- 
pletely conserved residues. However, the method may he 
superior to sequence gazing in the identification of more 
subtle sequence patterns that are difficult to pick out by 
eye and require labour intensive inspection of both mul- 
tiple sequence alignments and family trees. 



The analytical and predictive power of the method 
stems from the introduction of a conceptually novel view 
of sequence diversity, with complementary representa- 
tion of both proteins and residues in the same math- 
ematical space. The practical advantage of the method 
becomes apparent when a large sequence family is 
analysed and/or when conservation patterns are subtle. 
We anticipate that the sequence space method will be 
useful to molecular biologists as a tool to aid prediction 
and classification of functional residues and for plan- 
ning targeted residue-specific functional experiments. 

Methods 

The sequence vectors f used in ihis approach are analogous to 
conventional sequence profiles derived from multiple alignments. 
Just as profiles give a tabular summary of the amino acid con ten i 
at each position in an alignment, a sequence vector consists of Is 
and Gs, depending only on whether a particular residue type is 
present at a sequence position or not (Fig. 4). When rearranged 
as a row vector F , the kth protein sequence corresponds to a 
point in 20/-dimensional space, where / is the length of the 
sequence alignment and k is the index for the protein. the 
alignrnent F of n sequences is a matrix of n rows of length 20/ , 
each row holding the components of a single sequence vector. 



The number of identities C between sequences k and k' can be 
expressed as the inner product of the sequence vectors (Fig, Ah). 

■.• 

A comparison matrix C with the number of identities for all pairs 
of sequences can thus be expressed as the matrix product between 
alignment F and its transpose l*' f : 

C = F F 1 

All possible sequences of length up to / residues can be represented 
as points in the 20 / -dimensional sequence space. Members of a 
protein family typically populate only a small region of this space, 
as they are similar In both length and sequence. As a result, the 
main features of a family can be described in a subspace of a 
smaller number of dimensions. The subspace most suitable for 
the description of d particular family can be found by solving an 
eigenvalue problem: the principle axes defining the subspace are 
the eigenvectors corresponding to the largest eigenvalues I of 
the comparison matrix C (rei. 3). The relations embed CO" = I (£» 
define thr: principle axes u p , and the coordinate x*of protein'jUn 
dimension p is j£ - (sqrt. X^fy. Representation of the family 
members in a lower-dimensional subspace, for example, as points 
in a two-dimensional graph, illustrates the main similarity 
relationships (Fig. 1). Subfamilies are revealed as clusters, 
analogous to major branches in a tree representation. Higher 
dimensions, in order of decreasing eigenvalues, describe 
increasingly finer details in the similarity relationships, analogous 
to sub-branches in a tree. 

The conceptual advantage of the novel vector representation 
of sequences becomes evident when vectors of individual residues 
(columns in alignment F), rather than those of entire sequences 
(rows in F), are projected along the principle directions. The 
principle axes defined as sequence patterns v* with components 
t*- can be obtained as F T u^X p 1U - V* where u p is an eigenvector and 
k* p the corresponding eigenvalue of comparison matrix C 
Coordinates y^'of residue r at position / in the sequence are 
J ' " " 1 ' 3), in this way individual residues can be 
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placed in the same space as the full-length protein sequences. 
This unified represent a tion emphasizes links between sequence 
subfamilies and their corresponding characteristic residues. 

The algorithm has been implemented in a computer program 
that reads a multiple alignment of sequences (alignment formats 
MSF*' or HSSP 5 ) and represents proteins as well as individual 
residues as points in two-dimensional subspaces of sequence space 
(Fig. 1). The axes A, x2, x3. of the graphs correspond to the first 
few principle axes. Different choices of axes, xl and x2 in f ig. la, 
lK orx2 and*3 in Fig. 1c. d bring out different specificity aspects. 



When coupled to an interactive graphics program, the user can 
explore sequence families and their specific residue patterns and 
can identify likely functional residues. 

The executable computer program called Sequence Space and 
a graphics program to view the results called Scatter will be made 
available for academic users by Internet from Up.EMBl- 
Heidelberg.de and www.EM8L-Heidelberg.de. 
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